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Abstract 


Real-world learning tasks often involve high-dimensional data sets with complex patterns of missing 
features. In this paper we review the problem of learning from incomplete data from two statistical 
perspectives—the likelihood-based and the Bayesian. The goal is two-fold: to place current neural net¬ 
work approaches to missing data within a statistical framework, and to describe a set of algorithms, derived 
from the likelihood-based framework, that handle clustering, classification, and function approximation 
from incomplete data in a principled and efficient manner. These algorithms are based on mixture mod¬ 
eling and make two distinct appeals to the Expectation-Maximization (EM) principle (Dempster et ah, 
1977)—both for the estimation of mixture components and for coping with the missing data. 
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1 Introduction 

In computational and biological learning the environ¬ 
ment does not often provide complete information to the 
learner. For Sample, a vision system may encounter 
many partially occluded examples of an object, yet. have 
to recover a model for the unoccluded object. Similarly, 
an adaptive controller may be required to learn a map¬ 
ping from sensor readings to actions even if its sensors 
are unreliable and sometimes fail to give readings. Ex¬ 
amples of data sets with missing values abound in ma¬ 
chine learning. 1 

In this paper we review the problem of learning from 
incomplete data from a statistical perspective. The goal 
is t.wo-fold: to place current neural network treatments 
of missing data within a statistical framework, and to 
derive from this framework a set of algorithms that han¬ 
dle incomplete data in a principled manner. To maintain 
the breadth of the review we discuss classification, func¬ 
tion approximation, and clustering problems. Because 
missing data can arise in both the input and the target 
variables, we treat both missing inputs and unlabeled 
data. 

The statistical framework that we adopt, (cf. Little 
and Rubin, 1987) makes a. distinction between the envi¬ 
ronment, which we assume to generate complete data, 
and the missing data mechanism which renders some 
of the output, of the environment, unobservable to the 
learner. The supervised learning problem consists of 
forming a map from inputs to targets. The unsuper¬ 
vised learning process generally consists of extracting 
some compact, statistical description of the inputs. In 
both these cases the learner may benefit, from knowl¬ 
edge of constraints both on the data, generation process 
(e.g., that. it. falls within a. certain parametric family), 
and on the mechanism which caused the pattern of in¬ 
completeness (e.g., that it. is independent, of the data, 
generation process). The use of statistical theory allows 
us to formalize the consequences of these constraints and 
provides us with a. framework for deriving learning algo¬ 
rithms that, make use of these consequences. 

Before developing a. framework for incomplete data, 
let. us motivate the problem with perhaps the simplest, 
statistical example that, illustrates an interaction be¬ 
tween the missing data, and the data, generation mecha¬ 
nisms. Imagine we wish to estimate the mean fi. y ) 
and covariance matrix S of a. bivariate normal distribu¬ 
tion, from a. data. set. X = {(*,;, J/i )}fLi where some of the 
observations of t/,; are missing (see Fig. 1). If we estimate 
fi. x by the mean of the Xi and y y by the mean of the ob¬ 
served values of t/,;, we will underestimate y y as we have 
ignored the covariance structure in the observed data. A 
more intelligent, heuristic would use the covariance struc¬ 
ture to fill-in the values of the missing t/,; by regressing 
them on the Xi. However, even this heuristic will yield a. 
biased estimate of the covariance matrix as the filled-in 
data, points will all fall along the regression line. Both 
of the above “filling-in” techniques, known as mean im¬ 
putation and regression imputation , yield unsatisfactory 

1 See, for example, the UCI Repository of Machine Learn¬ 
ing Databases (Murphy and Aha, 1992). 


results even on this simple parameter estimation prob¬ 
lem. 

The paper is organized as follows. In section 2 we 
outline the statistical framework that, defines the miss¬ 
ing data, and data, generation mechanisms. In section 3 
we proceed to describe a. likelihood-based approach to 
learning from incomplete data. In section 4 we use this 
approach to derive a. set of learning algorithms for func¬ 
tion approximation, classification, and clustering. Sec¬ 
tion 5 describes an alternative to the likelihood-based 
approach, the Bayesian approach, and several algorithms 
that, implement, it.. Section 6 discusses Boltzmann ma¬ 
chines and incomplete data. Finally, we conclude in sec¬ 
tion 7. 
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Figure 1: A simple example. Complete data, were gen¬ 
erated from a. Gaussian with mean (5,5) and covariance 
matrix (5/4 9/4, 9/4 17/4). Data, points with missing 
y values are denoted by hollow circles on the y = 0 
line. The solid square indicates the ( x,y ) mean calcu¬ 
lated over the observed data. The hollow square and the 
ellipse indicate the mean and standard deviation calcu¬ 
lated from the incomplete data, using a. maximum likeli¬ 
hood (ML) algorithm. Note that, the ML estimate of y y 
is higher than any of the observed values of y\ 


2 The Framework 

The statistical framework we present, is based on Lit¬ 
tle and Rubin (1987). We assume that the data. set. 
X = {x;}^ can be divided into an observed component. 
X° and a. missing component. X m . Each data, vector x,; 
may have different, patterns of missing features. We will 
not. distinguish for now between the input, and target, 
components of the data, vector. 

We formalize the notion of a. missing data, mechanism 
by defining a. missing data, indicator matrix i?, such that 

„ _ J 1, X{j observed, 

— 0, Xij missing. 



Both the data generation process and the missing data 
mechanism are considered to be random processes, with 
joint probability distribution given by 

P(X, R\9, <f>) = P(X\9)P(R\X, <f>). (1) 

We use the parameters 9 for the data generation process 
and a distinct set of parameters, <f>, for the missing data 
mechanism. 

Once the data probability model has been decom¬ 
posed as in (1), we can distinguish three nested types 
of missing data mechanism. The data can be 

1. Missing completely at 

random (MCAR): P(R\X ,<f>) = P(R\<f>). That is, 
the probability that x^ is missing is independent 
of the values in the data vector. 

2. Missing at random (MAR): P(R\X°, X m , <f>) = 

P(R\X°, (f>). That is, the probability that Xij is 
missing is independent of values of missing com¬ 
ponents of the data, but may depend on the val¬ 
ues of observed components. For example, Xij may 
be missing for certain values of provided 

that Xik is always observed. Figure 1 illustrates 
this case. 

3. Not missing at random (NMAR). That is, 
P(R \X°, X m , <f>) may depend on the value of Xij. 
If P(Rij | x^, <f>) is a function of Xij the data is said 
to be censored. For example, if a sensor fails when 
its input exceeds some range its output will be cen¬ 
sored. 

The type of the missing data mechanism is critical in 
evaluating learning algorithms for handling incomplete 
data. Full maximum likelihood and Bayesian approaches 
can handle data that is missing at random or completely 
at random. Several simpler learning approaches can han¬ 
dle MCAR data but fail on MAR data in general. No 
general approaches exist for NMAR data. 

For both Bayesian and maximum likelihood tech¬ 
niques the estimates of the parameters 9 and <f> are linked 
to the observed data, X° and R, via P(X°, R\9, <f>). 2 For 
maximum likelihood methods the likelihood is 

L(6,<f>\X°,R) cx P(X°, R\9, <f>), 
and for Bayesian methods the posterior probability is 
P(9, 4>\X°, R) (x P(X°, R\9, <f>)P(9, <f>). 

We wish to ascertain under which conditions the pa¬ 
rameters of the data generation process can be estimated 
independently of the parameters of the missing data 
mechanism. Given that 

P(X°,R\9,<f>) = J P(X°,X Tn \9)P(R\X°,X Tn ,<f>)dX Tn , 

we note that if 

P(R\X°, X m , <f>) = P(R\X°,<i>), 

2 For succinctness will use the non-Bayesian phrase “esti¬ 
mating parameters” in this section; this can be replaced by 
“calculating the posterior probabilities of the parameters” for 
the parallel Bayesian argument. 


then 

P(X°,R\9,<f>) = P(R\X°,<f>) J P(X°,X Tn \9)dX Tn 
P(R\X°,<f>)P(X°\9). (2) 

Equation (2) states that if the data is MAR then 
the likelihood can be factored. For maximum like¬ 
lihood methods this implies directly that maximizing 
L(9\X °) oc P(X°\9) as a function of 9 is equivalent to 
maximizing L(9, <f>\X°, R). Therefore the parameters of 
the missing data mechanism can be ignored for the pur¬ 
poses of estimating 9 (Little and Rubin, 1987). 

For Bayesian methods, the missing data mechanism 
cannot be ignored unless we make the additional require¬ 
ment that the prior is factorizable: 

P(0,<f>) = P(0)P(<f>). 

These results imply that data sets that are NMAR, such 
as censored data, cannot be handled by Bayesian or 
likelihood-based methods unless a model of the missing 
data mechanism is also learned. On the positive side, 
they also imply that the MAR condition, which is weaker 
than the MCAR condition, is sufficient for Bayesian or 
likelihood-based learning. 

3 Likelihood-Based Methods for 
Feedforward Networks 

In the previous section we showed that maximum likeli¬ 
hood methods can be utilized for estimating the param¬ 
eters of the data generation model, ignoring the missing 
data mechanism, provided that the data is missing at 
random. We now turn to the problem of estimating the 
parameters of a model from incomplete data. 

We focus first on feedforward neural network models 
before turning to a class of models where the missing 
data can be incorporated more naturally into the esti¬ 
mation algorithm. For feedforward neural networks we 
know that descent in the error cost function can be inter¬ 
preted as ascent in the model parameter likelihood (e.g. 
White, 1989). In particular if the target vector is as¬ 
sumed to be Gaussian, T’(y J jxi, 9) ~ IV (yy fe(*i), erf I), 
then the log likelihood is equivalent to the sum-squared 
error weighted by the output variances: 

maxlogR (yjjxj- , 9) <^> mini ^ _L ( y . _ / e (x 8 )) 2 

i i 

If a target y; is missing or unknown the variance of that 
output can be taken to be infinite, erf —>■ oo. Similarly, if 
certain components of a target vector are missing we can 
assume that the variance of that component is infinite. 
The missing targets drop out of the likelihood and the 
minimization can proceed as before, simply with certain 
targets replaced by “don’t cares.” 

If components of an input vector are missing, how¬ 
ever, then the likelihood is not properly defined since 
_P(y 8 jx 8 ', 9) depends on the full input vector. The con¬ 
ditional over the observed inputs needed for the likeli¬ 
hood requires integrating out the missing inputs. This, 
in turn, requires a model for the input density, P(x), 



which is not explicitly available in a feedforward neural 
network. 

Tresp et al. (1994) proposed solving this problem by 
separately estimating the input density, P(x), with a 
Gaussian mixture model and the conditional density, 
P(y|x), with a feedforward network. This approach can 
be seen as maximizing the joint input-output log likeli¬ 
hood 

l = l°g , y*- |6f, <^) 

i 

= Y^logP(y i \ic i ,6) + Y^logP(x i \<j>) 

i i 


where the feedforward network is parametrized by 6 and 
the mixture model is parametrized by <f>. If some com¬ 
ponents of an input vector are missing the observed data 
likelihood can be expressed as 


P(yi\x°,9) = j -P(yi|x°,x m ,6»)T’(x m |x°, (? i)dx m , 

where the input has been decomposed into its observed 
and missing components, x = (x°,x m ). The mixture 
model is used to integrate the likelihood over the missing 
inputs of the feedforward network. The gradient of this 
likelihood with respect to the network parameters, 
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exhibits error terms which weight each completion of the 
missing input vector by _P(x m |y 8 ', x°, 9, <f>). This term, by 
Bayes rule, is proportional to the product of the prob¬ 
ability of the completion given the input, T’(x m |x°, <f>), 
and the posterior probability of the output given that 
completion _P(y 8 jx°, x m , 6). The integral can be approx¬ 
imated by a Monte Carlo method, where, for each miss¬ 
ing input, several completions are generated according 
to the input distribution. An intuitively appealing as¬ 
pect of this method is that more weight is placed on 
error gradients from input completions that better ap¬ 
proximate the target (Tresp et ah, 1994; Buntine and 
Weigend, 1991). 

These arguments imply that computing maximum 
likelihood estimates from missing inputs requires a 
model of the joint input density. In principle this could 
be achieved by multiple feedforward networks each learn¬ 
ing a particular conditional density of inputs. For exam¬ 
ple, if the pattern of missing inputs is monotone, i.e. the 
d input dimensions can be ordered such that if Xij is 
observed then all Xij. for k < j are also observed, then 
the missing data can be completed by a cascade of d — 1 
networks. Each network is trained to predict one in¬ 
put dimension from completed instances of all the lower 
index input dimensions and therefore models that par¬ 
ticular conditional density (cf. regression imputation for 
monotone multivariate normal data; Little and Rubin, 
1987). 

However, to accommodate general patterns of miss¬ 
ing inputs and targets the approach of using multiple 

O 


feedforward networks becomes practically cumbersome 
as the number of such networks grows exponentially with 
the data dimensionality. This problem can be avoided 
by modeling both the input and output densities using 
a mixture model. 

4 Mixture Models and Incomplete Data 

The mixture modeling framework allows learning from 
data sets with arbitrary patterns of incompleteness. 
Learning in this framework is a classical estimation prob¬ 
lem requiring an explicit probabilistic model and an al¬ 
gorithm for estimating the parameters of the model. A 
possible disadvantage of parametric methods is their lack 
of flexibility when compared with nonparametric meth¬ 
ods. Mixture models, however, largely circumvent this 
problem as they combine much of the flexibility of non¬ 
parametric methods with certain of the analytic advan¬ 
tages of parametric methods (McLachlan and Basford, 
1988). 

Mixture models have been utilized recently for super¬ 
vised learning problems in the form of the “mixtures of 
experts” architecture (Jacobs et ah, 1991; Jordan and 
Jacobs, 1994). This architecture is a parametric re¬ 
gression model with a modular structure similar to the 
nonparametric decision tree and adaptive spline models 
(Breiman et ah, 1984; Friedman, 1991). The approach 
presented here differs from these regression-based ap¬ 
proaches in that the goal of learning is to estimate the 
density of the data. No distinction is made between in¬ 
put and output variables; the joint density is estimated 
and this estimate is then used to form an input/output 
map. Similar density estimation approaches have been 
discussed by Specht (1991) for nonparametric models, 
and Nowlan (1991) and Tresp et al. (1994), among oth¬ 
ers, for Gaussian mixture models. To estimate the vec¬ 
tor function y = /(x) the joint density P(n, y) is esti¬ 
mated and, given a particular input x, the conditional 
density P(y|x) is formed. To obtain a single estimate of 
y rather than the full conditional density one can evalu¬ 
ate y = if(y|x), the expectation of y given x. 

The most appealing feature of mixture models in the 
context of this paper is that they can deal naturally with 
incomplete data. In fact, the problem of estimating mix¬ 
ture densities can itself be viewed as a missing data prob¬ 
lem (the “labels” for the component densities are miss¬ 
ing) and an Expectation-Maximization (EM) algorithm 
(Dempster et al., 1977) can be developed to handle both 
kinds of missing data. 

4.1 The EM algorithm for mixture models 

This section outlines the estimation algorithm for find¬ 
ing the maximum likelihood parameters of a mixture 
model (Dempster et al., 1977). We model the data 
X = {xj-}))^ as being generated independently from a 
mixture density 

M 

P( x i) = (3) 

j =i 

where each component of the mixture is denoted uij and 
parametrized by 6j. We start by assuming complete 



data. From equation (3) and the independence assump¬ 
tion we see that the log likelihood of the parameters 
given the data set is 

N M 

K e \ x ) = E log E p ( x *l w J ; ^) P( E')- 

i= 1 j=l 

By the maximum likelihood principle the best model of 
the data has parameters that maximize l(9\X). This 
function, however, is not easily maximized numerically 
because it involves the log of a sum. 

Intuitively, there is a “credit-assignment” problem: 
it is not clear which component of the mixture gener¬ 
ated a given data point and thus which parameters to 
adjust to fit that data point. The EM algorithm for 
mixture models is an iterative method for solving this 
credit-assignment problem. The intuition is that if one 
had access to a “hidden” random variable z indicating 
which data point was generated by which component, 
then the overall maximization problem would decou¬ 
ple into a set of simple maximizations. Using the bi¬ 
nary indicator variables Z = {z;}^, defined such that 
z i = (zn , . . ., Z{m) and Zij = 1 iff x; is generated by 
Gaussian j, a “complete-data” log likelihood function 
can be written 

N M 

l c (6\X,Z) = EE Zij log[-P(xjjz 8 '; 6)P(zi] 6)\, (4) 

i= ij=i 

which does not involve a log of a summation. 

Since z is unknown l c cannot be utilized directly, so we 
instead work with its expectation, denoted by Q(9\8 k ). 
As shown by (Dempster et ah, 1977), 1(8\X) can be max¬ 
imized by iterating the following two steps: 

E-step: Q(9\9 k ) = E[l c (9\X , Z)\X , 8 k \ 

M-step: 9 k+ i = argmax Q(9\9 k ). (5) 

9 

The Expectation or E-step computes the expected com¬ 
plete data log likelihood, and the Maximization or M- 
step finds the parameters that maximize this likelihood. 
In practice, for densities from the exponential family the 
E-step reduces to computing the expectation over the 
missing data of the sufficient statistics required for the 
M-step. These two steps form the basis of the EM algo¬ 
rithm for mixture modeling. 

4.1.1 Incorporating missing values into the EM 
algorithm 

In the previous section we presented one aspect of the 
EM algorithm: learning mixture models. Another im¬ 
portant application of EM is to learning from data sets 
with missing values (Little and Rubin, 1987; Dempster 
et ah, 1977). This application has been pursued in the 
statistics literature mostly for non-mixture density es¬ 
timation problems. 3 We now show how combining the 

3 Some exceptions are the use of mixture densities in the 
context of contaminated normal models for robust estima¬ 
tion (Little and Rubin, 1987), and in the context of mixed 
categorical and continuous data with missing values (Little 
and Schluchter, 1985). 


missing data application of EM with that of learning 
mixture parameters results in a set of clustering, classi¬ 
fication, and function approximation algorithms for in¬ 
complete data. 

Using the previously defined notation, x 8 - is divided 
into (x°,x™) where each data vector can have different 
patterns of missing components. (To denote the missing 
and observed components in each data vector we would 
ordinarily introduce superscripts m 8 and o;, however, we 
have simplified the notation for the sake of clarity.) 

To handle missing data we rewrite the EM algorithm 
incorporating both the indicator variables from algo¬ 
rithm (5) and the missing inputs, X m . 

E-step: Q(9\9 k ) = E[l c (9\X°, T m , Z)\X°, 8 k ] 

M-step: 9 k+ i = argmax Q(9\8 k ). 


The expected value in the E-step is taken with respect 
to both sets of missing variables. We proceed to illus¬ 
trate this algorithm for two classes of models, mixtures 
of Gaussians and mixtures of Bernoullis, which we later 
use as building blocks for classification and function ap¬ 
proximation. 


4.1.2 Real-valued data: mixture of Gaussians 

Real-valued data can be modeled as a mixture of 
Gaussians. We start with the estimation algorithm for 
complete data (Duda and Hart, 1973; Dempster et ah, 
1977; Nowlan, 1991). For this model the E-step simpli¬ 
fies to computing E[zij\xi, 8 k \. Given the binary nature 
of Zij, E[zij\xi, 8 k \, which we denote by hij, is the prob¬ 
ability that Gaussian j generated data point i. 

h _ IE |~ 1/2 exp{—7, (x 8 - - fi j ) T 'E,j 1 (x i - jij)} 

13 Efci |E|“ 1/2 exp{-±(x; - pt ; ) T SE(xi - fi { )} 

(8) 

The M-step re-estimates the means and covariances of 
the Gaussians 4 using the data set weighted by the hij: 


j 

-fc + 1 _ 2^8 = 1 n »’j X i 

r? - , 

l^i = 1 n ij 
k + l\ 


(7) 


** +1 EL%(x8-Rr i )(^-^r i ) 5 (o, 

E - h ' ° 

l^i = 1 n ij 

To incorporate missing data we begin by rewriting the 
log likelihood of the complete data, 


A + 1 \T 


l c (9\X°,X m ,Z) 


N M 

EE Zij log-P(Xjjzi) 9 ) + 


* J 
N M 


EE Zij log P(zi\8). 

i j 


(9) 


We can ignore the second term since we will only be es¬ 
timating the parameters of the P(x;|z;, 0). Specializing 
equation (9) to the mixture of Gaussians we note that 

4 Though this derivation assumes equal priors for the 
Gaussians, if the priors are viewed as mixing parameters they 
can also be learned in the maximization step. 



if only the indicator variables z; are missing, the E step 
can be reduced to estimating E[zij\xi, 6] as before. For 
the case we are interested in, with both z; and x™ miss¬ 
ing, we expand equation (9) using m and o superscripts 
to denote subvectors and submatrices of the parameters 
matching the missing and observed components of the 
data, 5 to obtain 

N M 

l c (e\X°,X m ,Z) = ^^z 8 J [|lo g 27 r+-log|E J | 

* 3 

-^(x?-^) T S7 1 ’ 00 (x?-^) 


Note that after taking the expectation, the sufh- 
cient statistics for the parameters include three un¬ 
known terms, z;j, Zy-x™, and z 8 jX™x™ t . Thus 
we must compute: E[zij\x.°, 6^], -E , [z J jX™|x°, Ok], and 
ff[z SJ x“x“ T |x°,^]. 

One intuitive approach to dealing with missing data 
is to use the current estimate of the data density to com¬ 
pute the expectation of the missing data in an E-step, 
complete the data with these expectations, and then use 
this completed data to re-estimate parameters in an M- 
step. However, as we have seen in section 1, this intuition 
fails even when dealing with a single two-dimensional 
Gaussian; the expectation of the missing data always lies 
along a line, which biases the estimate of the covariance. 
On the other hand, the approach arising from applica¬ 
tion of the EM algorithm specifies that one should use 
the current density estimate to compute the expectation 
of whatever incomplete terms appear in the likelihood 
maximization. For the mixture of Gaussians these in¬ 
complete terms are the interactions between the indica¬ 
tor variable z 8 y and the first and second moments of x™. 
Thus, simply computing the expectation of the missing 
data z i and x™ from the model and substituting those 
values into the M step is not sufficient to guarantee an 
increase in the likelihood of the parameters. 

To compute the above expectations we define 


= E[x?\ Zij = 1,X°, 0 h ] = M, m + S. mo E°°- (x° - rt), 


which is the least-squares linear regression between x™ 
and x° predicted by Gaussian j. Then, the first expec¬ 
tation is E[zij\x.°, 6k] = hij, the probability as defined 
in (6) measured only on the observed dimensions of x;. 
Similarly, we get 


E[zijiti 


- h-x 1 " 


5 For example, E is divided into 


E 

E 


O O 


mo 


E° m 

m m 


corre- 


sponding to x 



. Also note that the superscript 


( — 1, oo) denotes inverse followed by submatrix operations, 
whereas (oo -1 ) denotes the reverse order. 


and 


E[z k 


,m„mT | 


a] = m s“ m - s r s r 


ymoT , -m-mT 


( 10 ) 


The M-step uses these expectations substituted into 
equations (7) and (8) to re-estimate the means and co- 
variances. To re-estimate the mean vector, fij, we sub¬ 
stitute the values of x- for the missing components of 
x; in equation (7). To re-estimate the covariance matrix 
we substitute the values of the bracketed term in (10) for 
the outer product matrices involving the missing compo¬ 
nents of x; in equation (8). 


4.1.3 Discrete-valued data: mixture of 
Bernoullis 


Binary data can be modeled as a mixture of Bernoulli 
densities. That is, each D-dimensional vector x = 
(xi, . . ., Xd, . . . Xu), Xd G {0, 1}, is modeled as generated 
from the mixture of M Bernoulli densities: 

M D 

P(x\e) = J2 P(w) n 

j =1 d =1 


For this model the complete data E-step computes 


hij — 


Ud=i ^(i 


and the M-step re-estimates the parameters by 

y N h--x- 


- k + 1 

Pi = 


eZi k 


(ii) 


( 12 ) 


As before, to incorporate missing data we must 
compute the appropriate expectations of the sufficient 
statistics in the E-step. For the Bernoulli mixture 
these include the incomplete terms E[zij |x°, 8k] and 
£ , [zjjX™|x°, Ok]- The first is equal to hij calculated over 
the observed sub vector of x;. The second, since we as¬ 
sume that within a class the individual dimensions of the 
Bernoulli variable are independent, is simply hijfij 1 . The 
M-step uses these expectations substituted into equa¬ 
tion (12). 

More generally, discrete or categorical data can be 
modeled as generated by a mixture of multinomial den¬ 
sities and similar derivations for the learning algorithm 
can be applied. Finally, the extension to data with mixed 
real, binary, and categorical dimensions can be readily 
derived by assuming a joint density with mixed compo¬ 
nents of the three types. Such mixed models can serve 
to solve classification problems, as will be discussed in a 
later section. 


4.2 Clustering 

Gaussian mixture model estimation is a form of soft clus¬ 
tering (Nowlan, 1991). Furthermore, if a full covariance 
model is used the principal axes of the Gaussians align 
with the principal components of the data within each 
soft cluster. For binary or categorical data soft clus¬ 
tering algorithms can also be obtained using the above 
Bernoulli and multinomial mixture models. We illus¬ 
trate the extension of these clustering algorithms to miss¬ 
ing data problems with a simple example from character 
recognition. 
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Figure 2: Learning digit patterns. First row: the ten 5x7 templates used to generate the data set* Second row: 
templates with Gaussian noise added. Third row: templates with noise added and 50%) missing pixels. The training 
set consisted of ten such noisy, incomplete samples of each digit. Fourth row: means of the twelve Gaussians at 
asymptote (~ 30 passes through the data set of 100 patterns) using the mean imputation heuristic. Fifth row: 
means of the twelve Gaussians at asymptote (~ 60 passes, same incomplete data set) using the EM algorithm. 
Gaussians constrained to diagonal covariance matrices. 


In this example (Fig. 2), the Gaussian mixture algo¬ 
rithm was used on a training set of 100 35-dinrensional 
noisy greyscale digits with 50%) of the pixels missing. 
The EM algorithm approximated the cluster means from 
this highly deficient data set quite well. We compared 
EM to mean imputation, a common heuristic where 
the missing values are replaced with their unconditional 
means. The results showed that EM outperformed mean 
imputation when measured both by the distance be¬ 
tween the Gaussian means and the templates (see Fig. 2), 
and by the likelihoods (log likelihoods ± 1 s.e.: EM 
—4633 ± 328; mean imputation —10062 ± 1263; n = 5). 

4.3 Function approximation 

So far, we have alluded to data vectors with no refer¬ 
ence to “inputs” and “targets.” In supervised learning, 
however, we generally wish to predict target variables 
from some set of input variables—that is, we wish to ap¬ 
proximate a function relating these two sets of variables. 
If we decompose each data vector x,; into an “input” 
subvector, x), and a “target” or output sub vector, xj, 
then the relation between input and target variables can 
be expressed through the conditional density _P(xj|x) : ). 
This conditional density can be readily obtained from 
the joint input/target density, which is the density which 
all the above mixture models seek to estimate. Thus, 
in this framework, the distinction between supervised 
learning, i.e. function approximation, and unsupervised 
learning, i.e. density Estimation, is semantic, resulting 
from whether the data is considered to be composed of 
inputs and targets or not. 

Focusing on the Gaussian mixture model we note that 


the conditional density _P(xj|x) : ) is also a Gaussian mix¬ 
ture. Given a particular input the estimated output 
should summarize this density. 

If we require a single estimate of the output, a natu¬ 
ral candidate is the least squares estimate (LSE), which 
takes the formx t (x) : ) = ^(xjlxj). Expanding the expec¬ 
tation we get 

M 

*(*) = £ %itfj + 1 ( :x i - Rj)]- (13) 

j =i 

which is a convex sum of the least squares linear approx¬ 
imations given by each Gaussian. The weights in the 
sum, hij , vary nonlinearly over the input space and can 
be viewed as corresponding to the output of a classifier 
that assigns to each point in the input space a probability 
of belonging to each Gaussian. b The least squares esti¬ 
mator has interesting relations to models such as CART 
(Breiman et ah, 1984), MARS (Friedman, 1991), and 
mixtures of experts (Jacobs et ah, 1991; Jordan and 
Jacobs, 1994), in that the mixture of Gaussians com¬ 
petitively partitions the input space, and learns a linear 
regression surface on each partition. This similarity has 
also been noted by Tresp et al. (1994). 

If the Gaussian covariance matrices are constrained to 
be diagonal, the least squares estimate further simplifies 
to 

M 

|=i 

6 The hij in equation (13) are computed by substituting 
x) into equation (6) and evaluating the exponentials over the 
dimensions of the input space. 



the average of the output means, weighted by the prox¬ 
imity of x) to the Gaussian input means. This expression 
has a form identical to normalized radial basis function 
(RBF) networks (Moody and Darken, 1989; Poggio and 
Girosi, 1989), although the two algorithms are derived 
from disparate frameworks. In the limit, as the covari¬ 
ance matrices of the Gaussians approach zero, the ap¬ 
proximation becomes a nearest-neighbor map. 

Not all learning problems lend themselves to least 
squares estimates—many problems involve learning a 
one-to-many mapping between the input and target vari¬ 
ables (Ghahramani, 1994). The resulting conditional 
densities are multimodal and no single value of the 
output given the input will appropriately reflect this 
fact (Shizawa, 1993; Ghahramani, 1994; Bishop, 1994). 
For such problems a stochastic estimator, where the out¬ 
put is sampled according to x t (x)) ~ T’(x*|x)), is to be 
preferred to the least squares estimator. 

For learning problems involving discrete variables the 
LSE and stochastic estimators have a different interpre¬ 
tation. If we wish to obtain the posterior probability of 
the output given the input we would use the LSE esti¬ 
mator. On the other hand, if we wish to obtain output 
estimates that fall in our discrete output space we would 
use the stochastic estimator. 

4.4 Classification 

Classification with missing inputs 
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Figure 3: Classification of the iris data set. 100 data 
points were used for training and 50 for testing. Each 
data point consisted of 4 real-valued attributes and one 
of three class labels. The figure shows classification per¬ 
formance ± 1 standard error (n = 5) as a function of 
proportion missing features for the EM algorithm and 
for mean imputation (MI), a common heuristic where 
the missing values are replaced with their unconditional 
means. 


Classification, though strictly speaking a special case 
of function approximation, merits attention of its own. 
Classification problems involve learning a mapping from 
an input space of attributes into a set of discrete class 


labels. The mixture modeling framework presented here 
lends itself readily to classification problems by modeling 
the class label as a multinomial variable. For example, 
if the attributes are real-valued and there are D class 
labels, a mixture model with Gaussian and multinomial 
components can be used; 


M 


P(x, C = d\6) = ^FK ) 
i=i 


/jjd 


( 2 tt )”/ 2 1 Sj 11/ 2 


ex P{—2( x_ ^i) Ts i 1 ( x ~ Pj)} 


denotes the joint probability that the data point has at¬ 
tributes x and belongs to class d, where the /j.jd are the 
parameters for the multinomial class variable. That is, 

Hjd = P(C = d\uij,6), and J2d=i Nd = 1- 

Missing attributes and missing class labels (i.e., unla¬ 
beled data points) are readily handled via the EM algo¬ 
rithm. In the E-step, missing attributes are completed 
using the same formulas as for the Gaussian mixture ex¬ 
cept. that 


hij = F(x“, C{ = d\uij, 9) 


^jdP(x°\uij,9) 
Ef=i /hd-P(x?|w,,0) 


On the other hand, if a class label is missing hij becomes 
P(xi\uij, 0)/ Ef=i -P(x,;|w/, 9), exactly as in the Gaussian 
mixture. The class label is then completed with a prob¬ 
ability vector whose component is Ejli hijl^jd- 

Once the classification model has been estimated, 
the most likely label for a particular input x may be 
obtained by computing P(C = c/|x, 0). Similarly, the 
class conditional densities can be computed by evaluat¬ 
ing .P(x|C = d,6). Conditionalizing over classes in this 
way yields class conditional densities which are in turn 
mixtures of Gaussians. Figure 3 shows the performance 
of the EM algorithm on a sample classification problem 
with varying proportions of missing features. 

This mixture-based approach to classification is 
closely related to the mixture discriminant analysis 
(MDA) approach recently proposed by Ha.st.ie and Tib- 
shirani (1994). In MDA, classes are also fit. by mixture 
densities using the EM algorithm and an optimal dis¬ 
criminant. is obtained. Ha.st.ie and Tibshira.ni extend 
this basic MDA procedure by combining it. with reduced 
rank discrimination. Like Fisher-Ra.o linear discriminant, 
analysis this results in an interpretable, low dimensional 
projection of the data, and often also leads to improved 
classification performance. While the authors do not. 
mention missing data, it. seems likely that EM methods 
can be used in thjjejcontext of their algorithm. 

Previous approaches to classification from incomplete 
patterns have proceeded along different, lines. Cheese- 
man et. a.l. (1988) describe a. Bayesian classification 
method in which each class is modeled as having Gaus¬ 
sian real-valued attributes and multinomial discrete at¬ 
tributes. The learning procedure finds the maximum a 
posteriori parameters of the model by differentiating the 
posterior probability of the class parameters and setting 
to zero. This yields a. coupled set. of nonlinear equations, 
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similar to the EM steps, which can be iterated to find 
the posterior mode of the parameters (Dempster et ah, 
1977). To handle missing data the authors state that 
“for discrete attributes it can be shown that the correct 
procedure for treating an unknown value is equivalent to 
adding an ‘unknown’ category to the value set” (p. 62). 
For real-valued attributes they add a ‘known’/‘unknown’ 
category to each attribute and set its value appropri¬ 
ately. Three comments can be made about this ap¬ 
proach. First, each ‘unknown’ category added to the 
multinomial value set results in an extra parameter that 
has to be estimated. Furthermore, adding an ‘unknown’ 
category does not reflect the fact that the unobserved 
data actually arises from the original multinomial value 
set (an argument also made by Quinlan, 1986; see be¬ 
low). For example, for a data set in which one attribute 
is often unknown the algorithm may form a class based 
on that attribute taking on the value ‘unknown’—a situ¬ 
ation which is clearly undesirable in a classifier. Finally, 
as each class is modeled by a single Gaussian or multino¬ 
mial and the data points are assumed to be unlabeled, 
the Cheeseman et al. (1988) algorithm is in fact a form 
of soft clustering. 

Southcott and Bogner (1993) have approached the 
problem of classification of incomplete data using an ap¬ 
proximation to EM for clustering. In the E-step, the 
observed data are classified using the current mixture 
model, and each data point is assigned to its most likely 
class. The parameters of each class are then re-estimated 
in the M-step. In our notation this approximation corre¬ 
sponds to setting the highest hij for each data point to 
1 and all the others to 0. They compared this method 
with a neural network based algorithm in which each 
missing input is varied through the possible range of 
(discrete) attribute values to find the completion result¬ 
ing in minimum classification error. They reported that 
their approximation to EM outperformed both the neu¬ 
ral network algorithm and an algorithm based on linear 
discriminant analysis. They did not include the exact 
EM algorithm in their comparison. 

Quinlan (1986,1989) discusses the problem of missing 
data in the context of decision tree classifiers. Quinlan’s 
decision tree framework uses a measure of information 
gain to build a classifier, resulting in a tree structure 
of queries on attribute values and a set of leaves rep¬ 
resenting class membership. The author concludes that 
treating ‘unknown’ as a separate value is not a good so¬ 
lution to the missing value problem, as querying on at¬ 
tributes with unknown values will have higher apparent 
information gain (Quinlan, 1986). The approach that 
he advocates instead is to compute the expected infor¬ 
mation gain, by assuming that the unknown attribute is 
distributed according to the observed values in the sub¬ 
set of the data at that node of the tree. This approach 
is consistent with the information theoretic framework 
adopted in his work and parallels the EM and Bayesian 
treatments of missing data which suggest integrating 
over the possible missing values. 

An alternative method of handling missing data in de¬ 
cision trees is presented by Breiman et al. (1984) for the 
CART algorithm. CART initially constructs a large de¬ 


cision tree based on a splitting criterion closely related to 
the above measure of information gain. The tree is then 
pruned recursively using a measure of model complexity 
proportional to the number of terminal nodes, resulting 
in a smaller, more interpretable tree with better gener¬ 
alization properties. If a case is missing the value of an 
attribute then it is not considered when evaluating the 
goodness of splits on that attribute. Cases are assigned 
to branches of a split on an attribute where they have 
missing values using the best ‘surrogate split’—i.e. the 
split on another attribute which partitions the data most 
similarly to the original split. This method works well 
when there is a single, highly correlated attribute that 
predicts the effects of a split along the missing attribute. 
However, if no single attribute can predict the effects of 
the split this method may not perform well. An approach 
based on computing the expected split from all the ob¬ 
served variables, similar to Quinlan’s, would be more 
suitable from a statistical perspective and may provide 
improved performance with missing data. 

5 Bayesian methods 

In Bayesian learning the parameters are treated as un¬ 
known random variables characterized by a probability 
distribution. Bayesian learning utilizes a prior distribu¬ 
tion for the parameters, which may encode world knowl¬ 
edge, initial biases of the learner, or constraints on the 
probable parameter values. Learning proceeds by Bayes’ 
rule—multiplying the prior probability of the parameters 
by the likelihood of the data given the parameters, and 
normalizing by the integral over the parameter space— 
resulting in a posterior distribution of the parameters. 
The information learned about the unknown parameters 
is expressed in the form of this posterior probability dis¬ 
tribution. 

In the context of learning from incomplete data, the 
Bayesian use of priors can have impact in two arenas. 
First, the prior may reflect assumptions about the initial 
distribution of parameter values as described above. The 
learning procedure converts this prior into a posterior via 
the data likelihood. We have seen that to perform this 
conversion independently of the missing data mechanism 
requires both that the mechanism be missing at random 
and that the prior be factorizable. Second, the prior 
may reflect assumptions about the initial distribution of 
the missing values. Thus, if we have a prior distribution 
for input values we can complete the missing data by 
sampling from this distribution. 

For complete data problems and simple models the 
judicious choice of conjugate priors for the parameters 
often allows analytic computation of their posterior dis¬ 
tribution (Box and Tiao, 1973). However, in incomplete 
data problems the usual choices of conjugate priors do 
not generally lead to recognizable posteriors, making it¬ 
erative simulation and sampling techniques for obtaining 
the posterior distribution indispensable (Schafer, 1994). 

5.1 Data augmentation and Gibbs sampling 

One such technique, which is closely related in form to 
the EM algorithm, is data augmentation (Tanner and 
Wong, 1987). This iterative algorithm consists of two 



steps. In the Imputation or I-step, instead of comput¬ 
ing the expectations of the missing sufficient statistics, 
we simulate m random draws of the missing data from 
their conditional distribution f 3 (fk’ m |fk’°, 9). In the Pos¬ 
terior or P-step we sample m times from the posterior 
distribution of the parameters, which can now be more 
easily computed with the imputed data: P(9\X°, th m ). 
Thus, we obtain samples from the joint distribution of 
T’(T’ m , 9\X°) by alternately conditioning on one or the 
other of the unknown variables, a technique known as 
Gibbs sampling (Geman and Geman, 1984). Under some 
mild regularity conditions this algorithm can be shown 
to converge in distribution to the posterior (Tanner and 
Wong, 1987). Note that the augmented data can be cho¬ 
sen so as to simplify the P-step in much the same way as 
indicator variables can be chosen to simplify the M-step 
in EM. 

Data aug¬ 

mentation techniques have been recently combined with 
the Metropolis-Hastings algorithm (Schafer, 1994). In 
Metropolis-Hastings (Metropolis et ah, 1953; Hastings, 
1970), one creates a Monte Carlo Markov chain by draw¬ 
ing from a probability distribution meant to approximate 
the distribution of interest and accepting or rejecting the 
drawn value based on an acceptance ratio. The accep¬ 
tance ratio, e.g. the ratio of probabilities of the drawn 
state and the previous state, can often be chosen to be 
easy to calculate as it does not involve computation of 
the normalization factor. If the transition probabilities 
allow any state to be reached eventually from any other 
state (i.e. the chain is ergodic) then the Markov chain 
will approach its stationary distribution, chosen to be 
the distribution of interest, from any initial distribution. 
The combination of data augmentation and Metropolis- 
Hastings can be used, for example, in problems where 
the posterior itself is difficult to sample from in the P- 
step. For such problems one may generate a Markov 
chain whose stationary distribution is P(9\X°,X m ). 

5.2 Multiple imputation and Bayesian 
backpropagation 

Multiple imputation (Rubin, 1987) is a technique in 
which each missing value is replaced by m simulated val¬ 
ues which reflect uncertainty about the true value of the 
missing data. After multiple imputation, m completed 
data sets exist, each of which can be analyzed using com¬ 
plete data methods. The results can then be combined 
to form a single inference. Though multiple imputation 
requires sampling from T’(T’ m \X°, 9), which may be dif¬ 
ficult , iterative simulation methods can also be used in 
this context (Schafer, 1994). 

The Bayesian backpropagation technique for missing 
data presented by Buntine and Weigend (1991) is a spe¬ 
cial case of multiple imputation. In Bayesian backpropa¬ 
gation, multiple values of the input are imputed accord¬ 
ing to a prior distribution so as to approximate the inte¬ 
gral in (3), which in turn is used to compute the gradient 
required for backpropagation. This procedure is similar 
to that of Tresp et al. (1994), except that whereas the 
former completes the data by sampling from a prior dis¬ 
tribution of inputs, the latter estimates this distribution 


directly from the data. 7 

6 Boltzmann machines and incomplete 
data 

Boltzmann machines are networks of binary stochastic 
units with symmetric connections, in which learning cor¬ 
responds to minimizing the relative entropy between the 
probability distribution of the visible states and a target 
distribution (Hinton and Sejnowski, 1986). The relative 
entropy cost function can be rewritten to reveal that, 
if the target distribution is taken to be the empirical 
distribution of the data, it is equivalent to the model 
likelihood. Therefore, the Boltzmann learning rule im¬ 
plements maximum likelihood density estimation over 
binary variables. 

The Boltzmann learning procedure first estimates cor¬ 
relations between unit activities in a stage where both 
input and target units are clamped and in a stage where 
the target units are unclamped. These correlations are 
then used to modify the parameters of the network in 
the direction of the relative entropy cost gradient. This 
moves the output unit distribution in the unclamped 
phase closer to the target distribution in the clamped 
phase. 

Reformulated in terms of maximum likelihood condi¬ 
tional density estimation, the Boltzmann learning rule 
is an instance of the generalized EM algorithm (GEM; 
Dempster, Laird, and Rubin, 1977): the estimation of 
the unit correlations given the current weights and the 
clamped values corresponds to the E-step, and the up¬ 
date of the weights corresponds to the M-step (Hinton 
and Sejnowski, 1986). It is generalized EM in the sense 
that the M-step does not actually maximize the likeli¬ 
hood but simply increases it by gradient ascent. 

The incomplete variables in the Boltzmann machine 
are the states of the hidden units—those that are not 
denoted as the visible input or output units. This sug¬ 
gests that the principled way of handling missing inputs 
or targets in a Boltzmann machine is to treat them as 
hidden units, that is, to leave them unclamped. Ex¬ 
actly as in the formulation for mixture models presented 
above, the EM algorithm will then estimate the appro¬ 
priate sufficient statistics—the first order correlations— 
in the E-step. These sufficient statistics will then be used 
to increase the model likelihood in the M-step. 

7 Conclusions 

There are several ways of handling missing data dur¬ 
ing learning. Heuristics, such as Riling in the missing 
data with unconditional or conditional means, are not al¬ 
ways efficient, discarding information latent in the data 
set. More principled statistical approaches yield inter¬ 
pretable results, providing a guarantee to find the max¬ 
imum likelihood parameters despite the missing data. 

These statistical approaches argue convincingly that 
the missing data has to be integrated out using an esti¬ 
mate of the data density. One class of models in which 

7 From a strictly Bayesian point of view both procedures 
are improper in that they don’t take into account the vari¬ 
ability of the parameters in the integration. 



this can be performed naturally and efficiently are mix¬ 
ture models. For these models, we have described appli¬ 
cations to clustering, function approximation, and clas¬ 
sification from real and discrete data. In particular, we 
have shown how missing inputs and targets can be incor¬ 
porated into the mixture model framework—essentially 
by making a dual use of the ubiquitous EM algorithm. 

Finally, our principal conclusion is that virtually all of 
the incomplete data techniques reviewed from the neural 
network and machine learning literatures can be placed 
within this basic statistical framework. 
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